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Abstract 

Product Distribution (PD) theory is a new framework for 
controlling Multi-Agent Systems (MAS’s). First we review 
one motivation of PD theory, as the information-theoretic 
extension of conventional full-rationality game theory to the 
case of bounded rational agents . In this extension the equi- 
librium of the game is the optimizer of a Lagrangian of the 
(probability distribution of) the joint state of the agents . Ac- 
cordingly we can consider a team game in which the shared 
utility is a performance measure of the behavior of the MAS. 
For such a scenario the game is at equilibrium — the La- 
grangian is optimized — when the joint distribution of the 
agents optimizes the system’s expected performance. One 
common way to £nd that equilibrium is to have each agent 
run a reinforcement learning algorithm. Here we investi- 
gate the alternative of exploiting PD theory to run gradi- 
ent descent on the LagrangiarL We present computer exper- 
iments validating some of the predictions of PD theory for 
how best to do that gradient descent. We also demonstrate 
how PD theory can improve performance even when we are 
not allowed to rerun the MAS from different initial condi- 
tions, a requirement implicit in some previous work. 


1. Introduction 

Product Distribution (PD) theory is a recently intro- 
duced broad framework for analyzing, controlling, and op- 
timizing distributed systems [9, 10, 11]. Among its po- 
tential applications are adaptive, distributed control of a 
Multi-Agent System (MAS), (constrained) optimization, 
sampling of high-dimensional probability densities (i.e., 
improvements to Metropolis sampling.), density estimation, 
numerical integration, reinforcement learning, information- 
theoretic bounded rational game theory, population biology, 
and management theory. Some of these are investigated in 
[2, 1, 8, 3]. 

Here we investigate PD theory’s use for adaptive, dis- 
tributed control of a MAS. Typically such control is done 
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by having each agent run its own reinforcement learning al- 
gorithm [4, 13, 14, 12]. In this approach the utility func- 
tion of each agent is based on the world utility G(x) map- 
ping the joint move of the agents, x e X, to the perfor- 
mance of the overall system. However in practice the agents 
in a MAS are bounded rational. Moreover the equilibrium 
they reach will typically involve mixed strategies rather than 
pure strategies, i.e., they don’t settle on a single point x op- 
timizing G(x). This suggests formulating an approach that 
explicitly accounts for the bounded rational, mixed strategy 
character of the agents. 

Now in any game, bounded rational or otherwise, the 
agents are independent, with each agent i choosing its move 

at any instant by sampling its probability distribution 
(mixed strategy) at that instant, Accordingly, the 

distribution of the joint-moves is a product distribution, 
P(x) = qi(xi). In this representation of a MAS, all cou- 
pling between the agents occurs indirectly; it is the separate 
distributions of the agents { g* } that are statistically coupled, 
while the actual moves of the agents are independent 

PD theory adopts this perspective to show that the equi- 
librium of a MAS is the minimizer of a Lagrangian C(P ), 
derived using information theory, that quantiEes the ex- 
pected value of G for the joint distribution P(x). From this 
perspective, the update rules used by the agents in RL-based 
systems for controlling MAS’s are just particular (inefE- 
cient) ways of Ending that minimizing distribution. 

Various techniques for modifying the agents’ utilities 
from G to improve convergence have been used with such 
RL-based systems [13]. As illustrated below, by viewing the 
problem as one of Lagrangian-minimization, PD theory can 
be used to derive corrections to those techiques. In addition, 
PD theory suggests novel ways to £nd the equilbirum, e.g., 
applying any of the powerful search techniques for contin- 
uous variables, like gradient descent, to End the P optimiz- 
ing j C. By casting the problem this way in terms of End- 
ing an optimal P rather than Ending an optimal x, we can 
exploit the power of search techniques for continuous vari- 
ables even when X is a discrete, Enite space. Moreover, typ- 
ically such search techniques can be run in a highly dis- 


tributed fashion, since P is a product distribution. Finally, 
the techniques for modifying agents’ utilities in RL-based 
systems often require rerunning the MAS from different ini- 
tial conditions, or equivalently exploiting knowledge of the 
functional form of G to evaluate how particular changes to 
the initial conditions would affect G. PD theory provides us 
with techniques for improving convergence even when we 
have no such knowledge, and when we cannot rerun the sys- 
tem. 

In the next section we review the game-theory motivation 
of PD theory. We then present details of our Lagrangian- 
minimization algorithm. We end with computer experi- 
ments involving both controlling a spin glass (a system of 
coupled binary variables) and controlling the agents in a 
variant of the bar problem [13]. These experiments both val- 
idate some of the predictions of PD theory of how best to 
modify the agents’ utilities from G , and of how to improve 
convergence when one cannot rerun the system. 

2. Bounded Rational Game Theory 

In this section we motivate PD theory as the information- 
theoretic formulation of bounded rational game theory. 

2.1. Review of noncooperative game theory 

In noncooperative game theory one has a set of N play- 
ers. Each player i has its own set of allowed pure strate- 
gies. A mixed strategy is a distribution qi(x{) over player 
i’s possible pure strategies. Each player i also has a private 
utility function g± that maps the pure strategies adopted by 
all N of the players into the real numbers. So given mixed 
strategies of all the players, the expected utility of player i 
is E(gi) = f dx J]j qj(xj)gi(x) ’. 

In a Nash equilibrium every player adopts the mixed 
strategy that maximizes its expected utility, given the mixed 
strategies of the other players. More formally, Vi,# — 
argmax ? / / dx q[ qj (xj) g\ (x). Perhaps the major ob- 

jection that has been raised to the Nash equilibrium con- 
cept is its assumption of full rationality [5, 6], This is 
the assumption that every player i can both calculate what 
the strategies will be and then calculate its associated 
optimal distribution. In other words, it is the assumption 
that every player will calculate the entire joint distribution 
q(x) = If for no other reasons than computa- 

tional limitations of real humans, this assumption is essen- 
tially untenable. 


1 Throughout this paper, the integral sign is implicitly interpreted as ap- 
propriate, e.g., as Lebesgue integrals, point-sums, etc. 


2.2. Review of the maximum entropy principle 

Shannon was the £rst person to realize that based on any 
of several separate sets of very simple desiderata, there is a 
unique real- valued quantification of the amount of syntac- 
tic information in a distribution P(y). He showed that this 
amount of information is (the negative of) the Shannon en- 
tropy of that distribution, S(P) = — f dy P(y)Zn[-^]. 
So for example, the distribution with minimal information 
is the one that doesn’t distinguish at all between the various 
y , i.e., the uniform distribution. Conversely, the most infor- 
mative distribution is the one that speci£es a single possi- 
ble y. Note that for a product distribution, entropy is addi- 
live, i.e., S'dli 9i(Vi)) = Ei £(?»)• 

Say we given some incomplete prior knowledge about 
a distribution P(y). How should one estimate P{y) based 
on that prior knowledge? Shannon’s result tells us how to 
do that in the most conservative way: have your estimate of 
P(y) contain the minimal amount of extra information. be- 
yond that already contained in the prior knowledge about 
P{y). Intuitively, this can be viewed as a version of Oc- 
cam’s razor. This approach is called the maximum entropy 
(maxent) principle. It has proven useful in domains rang- 
ing from signal processing to supervised learning [7]. 

2.3. Maxent Lagrangians 

Much of the work on equilibrium concepts in game the- 
ory adopts the perspective of an external observer of a game. 
We are told something concerning the game, e.g., its utility 
functions, information sets, etc., and from that wish to pre- 
dict what joint strategy will be followed by real-world play- 
ers of the game. Say that in addition to such information, 
we are told the expected utilities of the players. What is our 
best estimate of the distribution q that generated those ex- 
pected utility values? By the maxent principle, it is the dis- 
tribution with maximal entropy, subject to those expectation 
values. 

To formalize this, for simplicity assume a £nite num- 
ber of players and of possible strategies for each player. 
To agree with the convention in other £elds, from now on 
we implicitly nip the sign of each so that the associated 
player i wants to nfrnimize that function rather than maxi- 
mize it. Intuitively, this nipped gi(x) is the “cost” to player 
i when the joint-strategy is x, though we will still use the 
term “utility”. 

Then for prior knowledge that the expected utilities of 
the players are given by the set of values {e r }, the max- 
ent estimate of the associated q is given by the minimizer of 
the Lagrangian 

£(q) = ^PilEgigi) - £i] - S(q) (i) 



= [ dx n Qi(. x i)9i( x ) - € i ] - s (qP) 

i J j 

where the subscript on the expectation value indicates that 
it evaluated under distribution g, and the {$} are “inverse 
temperatures” implicitly set by the constraints on the ex- 
pected utilities. 

Solving, we £nd that the mixed strategies minimizing the 
Lagrangian are related to each other via 

oc (3) 

where the overall proportionality constant for each i is set 
by normalization, and G = ^ 2 . In Eq. 3 the probabil- 

ity of player i choosing pure strategy depends on the ef- 
fect of that choice on the utilities of the other players. This 
rejects the fact that our prior knowledge concerns all the 
players equally. 

If we wish to focus only on the behavior of player z, it is 
appropriate to modify our prior knowledge. To see how to 
do this, £rst consider the case of maximal prior knowledge, 
in which we know the actual joint-strategy of the players, 
and therefore all of their expected costs. For this case, triv- 
ially, the maxent principle says we should “estimate” q as 
that joint- strategy (it being the q with maximal entropy that 
is consistent with our prior knowledge). The same conclu- 
sion holds if our prior knowledge also includes the expected 
cost of player i. 

Modify this maximal set of prior knowledge by remov- 
ing from it specification of player is strategy. So our prior 
knowledge is the mixed strategies of all players other than 
i, together with player z’s expected cost We can incorpo- 
rate prior knowledge of the other players’ mixed strategies 
directly, without introducing Lagrange parameters. The re- 
sultant maxent Lagrangian is 

A(<?i) = A[«i - E{gi)} - Si{qi) 

= PM ~ j dx qj{xj)gi{x)} - Si(qi) 

5 

solved by a set of coupled Boltzmann distributions: 

qi(xi) <x e~ PiB 'V (9i]Xi) . ( 4 ) 

Following Nash, we can use Brouwer’s £xed point theorem 
to establish that for any non-negative values {ft}, there must 
exist at least one product distribution given by the product 
of these Boltzmann distributions (one term in the product 
for each i). 

The £rst term in Ci is minimized by a perfectly ratio- . 
nal player. The second term is minimized by a perfectly 


2 The subscript q ^ on the expectation value indicates that it is evalu- 
ated according the distribution JJ Vj- 


irrational player, i.e., by a perfectly uniform mixed strat- 
egy q z . So 8 l in the maxent Lagrangian explicitly speci£es 
the balance between the rational and irrational behavior of 
the player In particular, for /? — * oo, by minimizing the La- 
grangians we recover the Nash equilibria of the game. More 
formally, in that limit the set of q that simultaneously min- 
imize the Lagrangians is the same as the set of delta func- 
tions about the Nash equilibria of the game. The same is 
true for Eq. 3. 

Eq. 3 is just a special case of Eq. 4, where all player’s 
share the same private utility, G. (Such games are known 
as team games.) This relationship reGects the fact that for 
this case, the difference between the maxent Lagrangian and 
the one in Eq. 2 is independent of g*. Due to this relation- 
ship, our guarantee of the existence of a solution to the set 
of maxent Lagrangians implies the existence of a solution 
of the form Eq. 3. Typically players a will be closer to min- 
imizing their expected cost than maximizing it. For prior 
knowledge consistent with such a case, the are all non- 
negative. 

For each player i de£ne 

fi(x, qi(xi )) = Pi 9 i{x) + ln[<?i(x f )]. (5) 

Then we can maxent Lagrangian for player i is 

£»(«) = j dx q{x)fi{x,qi(xi)). (6) 

Now in a bounded rational game every player sets its strat- 
egy to minimize its Lagrangian, given the strategies of the 
other players. In light of Eq. 6, this means that we inter- 
pret each player in a bounded rational game as being per- 
fectly rational for a utility that incorporates its computa- 
tional cost. To do so we simply need to expand the domain 
of “cost functions” to include probability values as well as 
joint moves. 

Often our prior knowledge will not consist of exact spec- 
i£cation of the expected costs of the players, even if that 
knowledge arises from watching the players make their 
moves. Such alternative kinds of prior knowledge are ad- 
dressed in [10, 11]. Those references also demonstrate the 
extension of the formulation to allow multiple utility func- 
tions of the players, and even variable numbers of players. 
Also discussed there are semi-coordinate transformations, 
under which, intuitively, the moves of the agents are modi- 
£ed to set in binding contracts. 

3. Optimizing the Lagrangian 

First we introduce the shorthand 

[G\xi] = E(G | Xi) (7) 

= f dx'5{x'i - Xi)G{x) JJ (8) 

j ¥< 


where the delta function forces x[ = Xi in the usual way. 
Now given any initial q , one may use gradient descent to 
search for the q optimizing £(q). Taking the appropriate 
partial derivatives, the descent direction is given by 

Agi(xi) = — = [G\xi] + /? -1 log q(xi) + C (9) 

oqi{Xi) 

where C is a constant set to preserve the norm of the prob- 
ability distribution after update, i.e., set to ensure that 

J dxiqi(xi) = J dxi(qi(xi) + A qtfe)) = 1. (10) 

Evaluating, we £nd that 

C = - /dx 1 / (11) 

(Note that for £nite X , those integrals are just sums.) 

To follow this gradient, we need an ef£cient scheme for 
estimation of the conditional expected G for different Xi. 
Here we do this via Monte Carlo sampling, i.e., by repeat- 
edly HD sampling q and recording ihe resultant private util- 
ity values. After using those samples to form an estimate of 
the gradient for each agent, we update the agents’ distribu- 
tions accordingly. We then start another block of HD sam- 
pling to generate estimates of the next gradients. 

In large systems, the sampling within any given block 
can be slow to converge. One way to speed up the conver- 
gence is to replace each [G | Xi] with [gi | X*], where gi is 
set to minimize bias plus variance [9, 11]: 

r £-1 

9i( x ) ■■= G(x) - J dxi G(x-i, x { ), (12) 

where L Xi is the number of times the particular value Xi 
arose in the most recent block of L samples. This is called 
the Aristocrat Utility (AU). It is a correction to the util- 
ity of the same name investigated in [13] and references 
therein. 

Note that evaluation of AU for agent i requires know- 
ing G for all possible x iy with x ^ held £xed. Accordingly, 
we consider an approximation, which is called the Won- 
derful Life utility (WLU), in which we replace the values 

L ~ 1 

r J . de£ning agent i’ s AU with a delta function about 
j dx i L *< 

about the least likely (according to qi) of that agent’s moves. 
(This is version of the utility of the same name investigated 
in [13] and references therein.) 

Below we present computer experiments validating the 
theoretical predictions that AU converges faster than the 
team game, and that the WLU de£ned here converges faster 
than its reverse, in which the delta function is centered on 
the most likely of the agent’s moves. 


Both WLU and AU require recording not just G(x) for 
the Monte Carlo sample x, but also G at a set of points re- 
lated in a particular way to x. When the functional form of 
G is known, often there is cancellation of terms that obvi- 
ates this need. Indeed, often what one does need to record 
in these cases is easier to evaluate than is G . However when 
the functional form of G is not known, using such private 
utilities would require rerunning the system, i.e., evaluat- 
ing G for many points besides x. 

PD theory provides us an alternative way to improve the. 
convergence of the sampling. This alternative exploits the 
fact that the joint distribution of all the agents is a prod- 
uct distribution. Accordingly, we can have the agents all an- 
nounce their separate distributions {qi} at the end of each 
block. By itself, this is of no help. However say that x as 
well as G(x) is recorded for all the samples taken so far 
(not just those in the preceding block). We can use this in- 
formation as a training set for a supervised learning algo- 
rithm that estimates G . Again, this piece of information is 
of no use by itself. But if we combine it with the announced 
{qi}, we can form an estimate of each [ G J x*j. 

This estimate is in addition to the estimate based on the 
Monte Carlo samples — here the Monte Carlo samples from 
all blocks are used r to approximate G(x), rather than to di- 
rectly estimate the various [G | xj. Accordingly we can 
combine these two estimates. Below we present computer 
experiments validating this technique. 

4. Experiments 

4.1. Known world utilities 

We £rst consider the case where the functional form of 
the world utility is known. Technically, the speci£c problem 
that we consider is the equilibration of a spin glass in an ex- 
ternal £eld, where each spin has a total angular momentum 
3/2. The problem consists of 50 spins in a circular forma- 
tion, where every spin is interacting with three spins on its 
right, three spins on its left as well as with itself. There are 
also external £elds which interact with each individual spin 
differently. The world utility is thus of the following form: 

G{x) ^ ^ hjXj “b ^ ^ j (13) 

i <t,j> 

where means summing over all the interacting pairs 

once. In our problem, the set elements in the set {hi} and 
{Jij} are generated uniformly at random form —0.5 to 0.5. 
The algorithm for the Lagrangian estimation goes as fol- 
lows: 

1. Each spin is treated as an agent which possesses 
a probability distribution on his set of actions: 
{qi(xi) \xi € <Ji = {—1, 1}}, which is initially set to 
be uniform. 


2. Each agent picks its choice of state according to the 
probability distribution for L times sequentially, where 
L is the Monte Carlo block size. We denote the num- 
ber of state Xi picked by agent i by L Xi . We require L Xi 
to be non-empty for all Xi E i.e., if some L Xi ~ 0, 
we randomly pick a sample x r and set Xi = x ' so that 
L Xi = 1. This process is to ensure that we can get 
conditional expected values \G\xi) for all Xi E a*. It 
should be noted though that it violates the assumptions 
of HD sampling underpinning the derivation of the pri- 
vate utilities minimizing bias plus variance. 

3. The gradients for each individual component is calcu- 
lated based on the L samples taken from the previous 
step (c.f. eq. 9), and gradient descents are performed 
for all i simultaneously. Since all probabilities must be 
positive, for each component i, the magnitude of de- 
scent is halved if qi(xi) is no longer positive for some 

Xi. 


4. Repeat steps 2 and 3. 

In £gure i, we have shown a comparison of three differ- 
ent ways of doing the descent direction estimation in step 3 
above. Team game means that we use [G\xi] to get the de- 
scent directions, weighted Aristocratic Utility corresponds 
to using the formula in eq. 12 to get the descent directions, 
and uniform Aristocratic Utility corresponds to simplifying 
the functions {#} to 

9i{x) ■= G(x) - yT E G ( x -i ,Xi). (14) 

ic7 ‘' 1 *,e CTl 


In £gure 1, we see that weighted AU outperforms uni- 
form AU except at ft* 1 = 0.2. This unexpected result at 
ft* 1 — 0.2 may be due to the limitation on the size of L. 
(Recall that we have required that L Xi ^ 0, and if it ever 
does, we randomly pick a sample x f and set Xi = x\ so 
that L Xi = 1.) Hence, as shown in £gure 3, the number 
of L Xi = I is greater when ft* 1 ~ 0.2 than that when 
P~ l = 0.6. This demonstrates that at /3~~ x = 0.2, quite a 
few redistributions of the samples are happening and hence 
the size of L has to be enlarged to get decent statistics. 

The speculation is further strengthened by comparing cor- 
£-1 

rect WLU (where T de£ning agent is AU are re- 
j dx i L ^ 

t 

placed with a delta function about about the least likely (ac- 
cording to of that agent’s moves) and incorrect WLU 
(where the same quantities are replaced with a delta func- 
tion about about the most likely (according to <&) of that 
agent’s moves) with different sample size L. As shown in 
£gures 4 and 5, the increase in sample size does amend the 
problem caused by resampling. 


5. Unknown world utilities 


We now consider the case where the explicit formula for 
the world utility is not known and hence the calculations for 
WLU, uniform AU and weighted AU are not possible. Re- 
call that for this case we require that each player not only 
submits her choices of actions during each Monte Carlo 
block, but her probability distribution as well. Although this 
brings a constant overhead to the transmission, this becomes 
negligible when L is large. 

The problem we consider here is a 100-agent 4-night bar 
problem [14]. In this problem, each agent’s strategy set con- 
sists of four elements: {1, 2, 3, 4}. The world utility is of the 
form: 

4 

G(x) s= —50 x y^e~^W/ 6 (15) 

k- i 

where fk(x) = J2i $( x i ~ &)> i.e., fki x ) is the number of 
agents attending the bar at night k. The precise algorithm is 
as follows: 


1. Each agent possesses a probability distribution on her 
set of actions: {qiixi) j Xi 6 which is initially set 
to be uniform. 

2. Each agent picks its state according to the probabil- 
ity distribution for L times sequentially, where L is the 
Monte Carlo block size, as well as her probability dis- 
tribution {qi(xi) | Xi E Again, we require L Xi to 
be non-empty for all Xi E u>i, he-, if some L Xi = 0, 
we randomly pick a sample x f and set Xi = x\ so that 
L Xi = 1. 

3. Denote the set of samples in the L Monte Carlo step by 
S 9 each agent generates a set of arti£cial data points ac- 
cording to agents’ probability distributions, and denote 
those by Then we de£ne the following quantity: 


G X . 


L-?22 S (x<-z'i)G(z) (16) 

} ' x'€S 

+7TT E <K*i-*')GsOr) (17) 

l /i *l A. 


where or is a weighting parameter between 0 and 1 and 
G is de£ned by: 


G s (x) := 


T,x’es d ( x '- X ') 


(18) 


where d( . , . ) is some appropriate metric. In the 
present 100-agent 4-night bar problem, d(rc, x f ) := 

e - 2x ^ fc= i j/ fc (x)-/ fc (x )[ w jj ere t j ie functions {/*/(.)} 

are as de£ned in eq. 15. 

4. Each agent updates her probability distribution accord- 
ing to the gradients calculated as in eq. 9 but with 
[GjrCi] replaced by G Xi . Again, for each agent i, the 



magnitude of descent is halved if qi(xi) is no longer 
positive for some x[. 

5. Repeat steps 2 to 4. 

Given that the utility function is reasonably smooth, it 
is natural to expect that the estimates aided by arti£cial 
data points will provide an improvement. And this is indeed 
shown in £gure 6 with varying a. The same experiments 
are also performed for the 50-spin model introduced in sec- 
tion 4.1 but with a new metric d(x, x ; ) = ~ x i)- 

The results are shown in £gure 7. 

An immediately improvement to the scheme above is to 
realize that there is no need to restrict ourselves to data 
available in that particular time step in calculating G Xi in 
eq. 16. Namely, unlike step 3 above, we can accumulate 
“true” data from previous steps in calculating G Xi which 
will certainly improve the accuracy of the estimation. To il- 
lustrate this idea, at time step t, we de£ne S' to be the set 
true samples drawn at time step t and with the set of true 
samples drawn at time step t — 1, and we calculate G Xi as: 

G Xi - ’^Y.6(x i -x'AG{x) (19) 

1°' x^S 

+ TT7 53 s ( Xi - x 'i)Gs'(x) • (20) 

The results for the bar problem are shown in £gure 8. 

6. Conclusion 

Product Distribution (PD) theory is a recently introduced 
broad framework for analyzing, controlling, and optimizing 
distributed systems [9, 10, 11]. Here we investigate PD the- 
ory’s use for adaptive, distributed control of a MAS. Typi- 
cally such control is done by having each agent run its own 
reinforcement learning algorithm [4, 13, 14, 12]. 

In this approach the utility function of each agent is 
based on the world utility G(x) mapping the joint move of 
the agents, x € AT, to the performance of the overall system. 
However in practice the agents in a MAS are bounded ratio- 
nal. Moreover the equilibrium they reach will typically in- 
volve mixed strategies rather than pure strategies, i.e., they 
don’t settle on a single point x optimizing G(x). This sug- 
gests formulating an approach that explicitly accounts for 
the bounded rational, mixed strategy character of the agents. 

PD theory directly addresses these issues by casting the 
control problem as one of minimizing a Lagrangian of the 
joint probablity distribution of the agents. This allows the 
equilibrium to be found using gradient descent techniques. 
In PD theory, such gradient descent can be done in a dis- 
tributed manner. 

We present experiments validating PD theory’s predic- 
tions for how to speed the convergence of that gradient de- 
scent, We then present other experiments validating the use 
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Figure 1. Plots of /T* 1 vs. the Lagrangian 
for different utilities. The curves are gener- 
ated by plotting the Lagrangian at the 20th 
timestep, i.e., after 20 descents. The initial 
step sizes are set to be 0.2 times the gradi- 
ents. Also, L = 100 and a total of 40 simula- 
tions are performed. (Red dotted line: Team 
game, green dashed line: uniform AU, blue 
solid line: weighted AU.) 



Figure 2. The time series of Lagrangian along 
the 20 time steps (curves generated at /? -1 = 
0.6). (Red dotted line: Team game, green 
dashed line: uniform AU, blue solid line: 
weighted AU.) 
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Figure 3. Plots of time step versus 
EoZi £* l€<7l ~ 1) at different tem- 

peratures. (Red dotted line: P~ 1 = 0.2, blue 
solid line: = 0.6.) 
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of PD theory to improve convergence even if one is not al- 
lowed to rerun the system (an approach common in RL- 
based schemes). These results demonstrate the power of PD 
theory for providing a principled way to control a MAS in 
a distributed manner. 
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Figure 4. The time series of Lagrangian along 
the 20 time steps with sample size L — 100 
(curves generated at f3~ l = 0.2). (Red dotted 
line: correct WLU, blue solid line: incorrect 
WLU.) 



Figure 5. The time series of Lagrangian along 
the 20 time steps with sample size L = 200 
(curves generated at = 0.2). (Red dotted 
line: correct WLU, blue solid line: incorrect 
WLU.) 



0 0.5 

Temperature 

Figure 6. Plots of /3 _1 vs. the Lagrangian with 
various weighing parameters a for the bar 
problem. The curves are generated by plot- 
ting the Lagrangian at the 20th timestep. (Red 
dotted line: no data augmentation, green 
dashed line: data aug. with a = 0.3, blue solid 
line: data aug. with a = 0.5, black dashed dot- 
ted line: data aug. with a — 0.7.) 
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Figure 7. Plots of /3 _1 vs. the Lagrangian 
for the 50-spin model. The curves are gen- 
erated by plotting the Lagrangian at the 20th 
timestep. (Red dotted line: no data augmen- 
tation, blue solid line: data aug. with a = 0.5.) 
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Figure 8. Plots of (3~ 1 vs. the Lagrangian 
with data accumulation (blue line) and with- 
out data accumulation (red dotted line) for 
the bar problem. The curves are generated 
by plotting the free energies at the 10th 
timestep. The initial step sizes are set to be 
0.2 times the gradients. Also, the number of 
true data |5| is 20, the number of data stored 
is 20 and the number of artiEcial data \Ai\ is 
40. A total of 80 simulations are performed. 
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